Worked Example 1
This example provides a step-by-step guide for analyzing the mlbootcamp5_train.xlsx dataset using Jamovi. The dataset contains health-related variables and aims to explore factors associated with cardiovascular disease. Follow these instructions to understand data preparation, univariate and bivariate descriptive statistics, and multivariate descriptive statistical methods.
mlbootcamp5_train.xlsx
Source: Analyzing cardiovascular data
This is a dataset on cardiovascular disease. All of the dataset values were collected at the moment of medical examination. There are 3 types of input features:
- Objective: factual information;
- Examination: results of medical examination;
- Subjective: information given by the patient.
| Feature | Variable Type | Variable | Value Type |
|---|---|---|---|
| Age | Objective Feature | age |
int (days) |
| Height | Objective Feature | height |
int (cm) |
| Weight | Objective Feature | weight |
float (kg) |
| Gender | Objective Feature | gender |
categorical code |
| Systolic blood pressure | Examination Feature | ap_hi |
int |
| Diastolic blood pressure | Examination Feature | ap_lo |
int |
| Cholesterol | Examination Feature | cholesterol |
1: normal, 2: above normal, 3: well above normal |
| Glucose | Examination Feature | gluc |
1: normal, 2: above normal, 3: well above normal |
| Smoking | Subjective Feature | smoke |
binary |
| Alcohol intake | Subjective Feature | alco |
binary |
| Physical activity | Subjective Feature | active |
binary |
| Presence or absence of cardiovascular disease | Target Variable | cardio |
binary |
Step 1. Import the Dataset
- Open Jamovi.
- Click on Open → This PC → Browse and load the
mlbootcamp5_train.xlsxfile.
Step 2. Inspect the Dataset
2.1 Dimensions
To find out how many rows and columns (the “dimensions”) your dataset has in Jamovi, look at the bottom of the data grid in the Data tab (Row count), and in the Variable tab (Variables). You will see the total number of rows (observations) and the total number of columns (variables).
Alternatively, you can use the Descriptives module:
Click on Analyses → Exploration → Descriptives.
Move all variables into the “Variables” box.
The right panel will show details about each variable, and at the bottom, Jamovi typically displays the number of valid rows for each variable (and hence the total).
For the mlbootcamp5_train.xlsx dataset (as provided on Kaggle), you should see:
Number of rows: 70,000
Number of columns: 13
Hence, the dataset’s dimensions are 70,000×13.
2.2 Variables names and descriptions
It’s important to review variable names and their descriptions so that your dataset is clear, readable, and ready for analysis. In Jamovi, you can check and modify variable names and descriptions by following these steps:
Open the Variables tab: Go to the Variables tab.
Rename and add a description: You can change the variable name (Name) to something more descriptive and add/edit the Description to provide more information about what the variable represents.
Below are the original variable names from the mlbootcamp5_train.xlsx and some suggested renaming for clarity, along with brief descriptions:
| Original Name | Suggested Name | Description |
|---|---|---|
id |
ParticipantID |
Unique identifier for each participant |
age |
Age_days |
Age of the participant in days |
gender |
Gender |
Gender of the participant (1 = female, 2 = male) |
height |
Height_cm |
Participant’s height in centimeters |
weight |
Weight_kg |
Participant’s weight in kilograms |
ap_hi |
BloodPressure_Systolic |
Systolic blood pressure (in mmHg) from the participant’s last check-up |
ap_lo |
BloodPressure_Diastolic |
Diastolic blood pressure (in mmHg) from the participant’s last check-up |
cholesterol |
CholesterolLevel |
Cholesterol level category (1 = normal, 2 = above normal, 3 = well above normal) |
gluc |
GlucoseLevel |
Glucose level category (1 = normal, 2 = above normal, 3 = well above normal) |
smoke |
SmokingStatus |
Smoking status (0 = non-smoker, 1 = smoker) |
alco |
AlcoholIntake |
Alcohol intake (0 = no, 1 = yes) |
active |
PhysicalActivity |
Physical activity (0 = no, 1 = yes) |
cardio |
CardiovascularDisease |
Whether the participant has been diagnosed with cardiovascular disease (0 = No, 1 = Yes) |
By renaming and describing your variables, you make the dataset more intuitive and easier to analyze or share with others.
Make sure that your variable descriptions also include the logic behind how each measurement or category was determined (e.g., what units were used, how categories were assigned, etc.) so that anyone reviewing the dataset can easily understand its structure and meaning.
Step 3. Variable types
When setting a dataset into Jamovi, it’s important to verify that each variable is assigned the correct Measure type. This ensures your analyses will treat the data appropriately. You can check and modify the measure type by selecting each variable in the Data tab and adjusting the Measure type in the sidebar (e.g., Continuous, Nominal, Ordinal, or ID). Below is a summary of recommended measure types for the variables in the mlbootcamp5_train.xlsx dataset:
| Name | Recommended Measure Type |
|---|---|
ParticipantID |
ID |
Age_days |
Continuous |
Gender |
Nominal |
Height_cm |
Continuous |
Weight_kg |
Continuous |
BloodPressure_Systolic |
Continuous |
BloodPressure_Diastolic |
Continuous |
CholesterolLevel |
Ordinal |
GlucoseLevel |
Ordinal |
SmokingStatus |
Nominal (binary) |
AlcoholIntake |
Nominal (binary) |
PhysicalActivity |
Nominal (binary) |
CardiovascularDisease |
Nominal (binary) |
Remember to double-check that each variable’s measure type accurately reflects the nature of the data (e.g., use Continuous for numeric measurements with a large range, Ordinal for ranked categories, Nominal for labels/categories, and ID for unique identifiers).
Step 4. Levels of categorical variables
4.1 Rename category levels
Renaming category levels in your variables is essential for clarity and interpretability. In Jamovi, you can rename levels by opening the Data tab, selecting the variable of interest, and editing the “Levels” (or “Labels”) in the sidebar. Properly labeled categories help ensure that others (and your future self!) can clearly understand your dataset and analysis outputs.
Below is an example table summarizing possible level renaming for some of the categorical variables:
| Variable | Original Levels | Suggested Level Names |
|---|---|---|
Gender |
1, 2 | 1 = Female, 2 = Male |
SmokingStatu |
0, 1 | 0 = Non-smoker, 1 = Smoker |
AlcoholIntake |
0, 1 | 0 = Non-drinker, 1 = Drinker |
PhysicalActivity |
0, 1 | 0 = Inactive, 1 = Active |
CholesterolLevel |
1, 2, 3 | 1 = Normal, 2 = Above Normal, 3 = Well Above Normal |
GlucoseLevel |
1, 2, 3 | 1 = Normal, 2 = Above Normal, 3 = Well Above Normal |
CardiovascularDisease |
0, 1 | 0 = No, 1 = Yes |
To rename levels:
Go to the Data tab in Jamovi.
Select the variable whose levels you want to rename.
Edit the labels under “Levels.”
4.2 Reorder category levels
When dealing with ordinal variables in Jamovi, it’s important to ensure that each level is not only properly named, but also placed in the correct order (i.e., from “lowest” or “least” to “highest” or “most”). This ordering allows Jamovi to correctly interpret the progression or ranking within the variable. Below is an example table summarizing how you might order the levels of the ordinal variables:
| Variable | Original Levels | Ordered Levels |
|---|---|---|
CholesterolLevel |
1, 2, 3 | 1 = Normal → 2 = Above Normal → 3 = Well Above Normal |
GlucoseLevel |
1, 2, 3 | 1 = Normal → 2 = Above Normal → 3 = Well Above Normal |
How to set the order in Jamovi:
Go to the Data tab.
Select the variable (e.g., cholesterol).
Under Measure type, set it to Ordinal.
Adjust the Levels so they appear in ascending or logical order (e.g., Normal, Above Normal, Well Above Normal).
By setting the appropriate order, you ensure that statistical tests will treat these variables as ordinal rather than nominal, preserving the meaningful ranking in your analysis.
Step 5. Recode variables
Sometimes you need to create new variables or modify existing variables to perform your analyses effectively. In Jamovi, you can compute, transform, or recode variables in a few easy steps using the Compute or Transform option. Below are some examples based on the mlbootcamp5_train.xlsx dataset.
Why Compute or Recode Variables?
Improve Clarity: Turning numeric codes into descriptive categories makes your analysis more understandable.
Enable Proper Analysis: Many statistical models or visualizations require continuous vs. categorical variables or well-defined groups.
Focus on Research Questions: Grouping and transforming data can help isolate the variables most relevant to your analysis goals.
By thoughtfully computing, transforming, and recoding variables, you can tailor your dataset to the exact questions you’re asking—leading to clearer, more insightful results.
5.1 Numeric to numeric (from one to one)
If your dataset stores the “age” variable in days (e.g., Age_days), you can convert it to years in Jamovi by creating a new Computed variable:
\[\text{Age\_years} = \frac{\text{Age\_days}}{365}\]
Converting age from days to years makes it more intuitive for analysis (e.g., categorizing individuals by age groups). This is an example of data transformation, where a new numeric variable is created from a single numeric variable in the dataset.
Steps in Jamovi
Go to the Data tab.
Select the variable
Age_days.At the top, click the Compute icon (the calculator symbol).
In the new compute panel, name your new variable (e.g.,
Age_years).In the Formula box, type the expression:
Age_days/365.
A new variable (e.g., Age_years) will appear in your dataset, reflecting the participant’s age in years instead of days.
5.2 Numeric to numeric (from many to one)
A common calculation is the Body Mass Index (BMI), which you can compute using:
\[\text{BMI} = \frac{\text{weight in kilograms}}{(\text{height in meters})^2}\]
Since height in the dataset is in centimeters, remember to convert it to meters before calculating BMI.
This is an example of a derived numeric variable, which is a variable that is calculated from many other numeric variables in the dataset.
Steps in Jamovi
Go to the Data tab.
Select the variable
Weight_kg.At the top, click the Compute icon (the calculator symbol).
In the new compute panel, name your new variable (e.g.,
BMI).In the Formula box, type the expression:
Weight_kg/(Height_cm/100)^2.
5.3 Numeric to categorical
5.3.1. Combined Indicator
You could also create a combined variable that indicates potential hypertension if both Systolic (BloodPressure_Systolic) and Diastolic (BloodPressure_Diastolic) blood pressure measurements exceed certain thresholds. For instance, you might label a person as High_BP if:
- Systolic \(\geq\) 140 AND Diastolic \(\geq\) 90
Otherwise, label them as Normal In Jamovi:
Steps in Jamovi
Go to the Data tab.
Select the variable
BloodPressure_Diastolic.At the top, click the Compute icon (the calculator symbol).
In the new compute panel, name your new variable (e.g.,
Hypertension)In the Formula box, type the expression:
IF(BloodPressure_Systolic >= 140 and BloodPressure_Diastolic >= 90,'High_BP', 'Normal')
5.3.2 Recoding a Numeric Variable into Categories
If you want to group participants into age brackets (e.g., “Young,” “Middle-aged,” “Older”), you could recode the Age_years variable into categories. For instance, you can recode ago groups:
0–35 years as “Young”
36–55 years as “Middle-aged”
Over 55 as “Older”
Steps in Jamovi
Go to the Data tab.
Select the variable
Age_years.At the top, click the Transform icon.
Choose Create new Transform… item from using transform list Figure 12.
Rename the Transform as `Age Group`, then click on Add recode condition button twice.
Fill in the condition boxes as you can see in Figure 13
Fill in the Variable suffix field with the text
..._cat.Close the Transform and Transformd Variable panes using the arrows.
Step 6. Filter dataset
The mlbootcamp5_train.xlsx dataset contains health and demographic information for 70,000 participants. However, some entries may contain inaccuracies or “dirt” that can skew your analysis. We’ll filter out the following erroneous patient segments:
Diastolic pressure higher than systolic pressure.
Height below the 2.5th percentile or above the 97.5th percentile.
Weight below the 2.5th percentile or above the 97.5th percentile.
After filtering, we’ll calculate the percentage of data removed.
Steps in Jamovi
Calculate Percentiles for Height and Weight
To filter out extreme values, we’ll determine the 2.5th and 97.5th percentiles for both height and weight.
Navigate to Descriptives:
Click on Analyses in the top menu.
Select Exploration and then Descriptives.
Calculate Percentiles for Height:
Variables: Drag
Height_cmandWeight_kgto the Variables box.Statistics:
In the Statistics section, check Percentile Values.
In Percentiles field enter 2.5 and 97.5 to calculate the 2.5th and 97.5th percentiles .
Note the Values: Record the 2.5th and 97.5th percentile values for height and weight from Result pane (Figure 14).
Compute a new Dummy Variable for erroneous cases
Go to Data tab.
Select the variable
Hypertension.At the top, click the Compute icon (the calculator symbol).
In the new compute panel, name your new variable (e.g.,
Erroneous_Data)In the Formula box, type the expression ( Figure 15 ):
IF((BloodPressure_Diastolic > BloodPressure_Systolic) or (Height_cm < 150) or (Height_cm > 180) or (Weight_kg < 51) or (Weight_kg > 108),'Erroneous', 'Valid')
Calculate Percentage Removed
Navigate to Descriptives:
Click on Analyses in the top menu.
Select Exploration and then Descriptives.
Frequency Table for Errorneous Data
Variables: Drag
Errorneous_Datathe Variables box.Frequencies:
Check on the Frequency Table checkbox (Figure 16).
Almost 10% of the dataset will be filtered out
Filter out errorneous data
Go to Data tab.
At the top, click the Filters icon.
In the new compute panel, name your new variable (e.g.,
Erroneous_Data)In the Filter 1 box, type the expression (Figure 17):
Erroneous_Data == 'Valid'
Step 7. Univariate Descriptive Statistics
Univariate descriptive statistics involve summarizing and describing the main features of a single variable. These statistics provide simple summaries about the sample and the measures, offering insights into the data’s central tendency, variability, distribution shape, and frequency. Mastering these techniques in Jamovi will equip you with the foundational skills necessary for effective data analysis in psychological research.
7.1 Measures of Central Tendency: Mean, Median, and Mode
Objective
Calculate the central tendency measures—Mean, Median, and Mode—for continuous variables such as Height_cm and Weight_kg.
Steps in Jamovi
Navigate to Descriptives:
- Click on Analyses → Exploration → Descriptives.
Select Variables:
- Drag
Height_cmandWeight_kginto the “Variables” box.
- Drag
Choose Statistics:
- In the “Statistics” section, ensure “Mean”, “Median”, and “Mode” are checked.
Interpreting Results
Mean: The average value, providing a measure of central tendency.
Median: The middle value when data is ordered, less affected by outliers.
Mode: The most frequently occurring value.
Example Interpretation
Height_cm: Mean = 164.49 cm, Median = 165 cm, Mode = 165 cm.
Weight_kg: Mean = 73.53 kg, Median = 72 kg, Mode = 65 kg.
7.2 Measures of Variability: Standard Deviation and Variance
Objective
Assess the spread or dispersion of continuous variables using Standard Deviation and Variance.
Steps in Jamovi
Navigate to Descriptives:
- Click on Analyses → Exploration → Descriptives.
Select Variables:
- Drag
Height_cmandWeight_kginto the “Variables” box.
- Drag
Choose Statistics:
- In the “Statistics” section, check “Std. deviation” and “Variance”.
Interpreting Results
Standard Deviation (SD): Indicates the average distance of data points from the mean.
Variance: The square of the standard deviation, representing data dispersion.
Example Interpretation
Height_cm: SD = 6.86 cm, Variance = 47.12 cm².
Weight_kg: SD = 11.91 kg, Variance = 141.95 kg².
7.3 Range and Interquartile Range (IQR)
Objective
Determine the Range and Interquartile Range (IQR) to understand the data’s spread and the middle 50% distribution.
Steps in Jamovi
Navigate to Descriptives:
- Click on Analyses → Exploration → Descriptives.
Select Variables:
- Drag
Height_cmandWeight_kginto the “Variables” box.
- Drag
Choose Statistics:
- In the “Statistics” section, check “Minimum”, “Maximum”, “Range”, and “IQR”.
Interpreting Results
Range: Difference between the maximum and minimum values.
IQR: Difference between the 75th percentile (Q3) and the 25th percentile (Q1), representing the middle 50% of data.
Example Interpretation
Height_cm: Range = 30 cm (150 cm to 180 cm), IQR = 9 cm.
Weight_kg: Range = 57 kg (51 kg to 108 kg), IQR = 16 kg.
7.4 Frequency Distributions for Categorical Variables
Objective
Summarize categorical variables by calculating the frequency and percentage of each category, such as Gender, SmokingStatus, and CholesterolLevel.
Steps in Jamovi
Navigate to Descriptives:
- Click on Analyses → Exploration → Descriptives.
Select Variables:
- Drag
Gender,SmokingStatus, andCholesterolLevelinto the “Variables” box.
- Drag
Choose Statistics:
Ensure “Frequency tables” is checked.
Optionally, check “Bar plot” under “Plots” for visual representation.
Interpreting Results
Frequency Tables: Show the count and percentage of each category within a variable.
Bar plot: Visual representation of the frequency distribution.
Example Interpretation:
Gender:
Female: 41,334 (65%)
Male: 21,925 (35%)
The dataset comprises 65% female and 35% male participants. This indicates a predominance of female participants in the study.
SmokingStatus:
Non-smoker: 57,800 (91%)
Smoker: 5,459 (9%)
A vast majority of participants are non-smokers (91%), while smokers constitute only 9% of the sample.
CholesterolLevel:
Normal: 47,719 (75%)
Above Normal: 8,428 (13%)
Well Above Normal: 7,112 (11%)
The distribution of cholesterol levels shows that:
75% of participants have normal cholesterol levels.
13% have above normal levels.
11% have well above normal levels.
7.4 Percentiles and Quartiles
Objective
Identify specific data points within a distribution by calculating Percentiles and Quartiles for continuous variables.
Steps in Jamovi
Navigate to Descriptives:
- Click on Analyses → Exploration → Descriptives.
Select Variables:
- Drag
Height_cmandWeight_kginto the “Variables” box.
- Drag
Choose Statistics:
In the “Statistics” section, check “Percentile Values”.
Check “Cut points for 4 equal groups” for quartiles ( 25th, 50th, and 75th percentiles)
Check “Percentiles”
andenter desired percentiles, e.g., specific percentiles like 2.5, 97.5 (or 25, 50, 75 for quartiles).
Interpreting Results
Quartiles:
Q1 (25th percentile): 25% of data falls below this value.
Q2 (50th percentile/Median): 50% of data falls below this value.
Q3 (75th percentile): 75% of data falls below this value.
Specific Percentiles: Useful for identifying outliers or specific data points.
Example Interpretation:
Height_cm:
Q1 = 160 cm
25% of participants have a height below 160 cm. This value marks the lower quartile of the height distribution.Median = 165 cm
The median height is 165 cm, meaning that 50% of the participants are shorter than 165 cm, and 50% are taller.Q3 = 169 cm
75% of participants have a height below 169 cm. This value represents the upper quartile of the height distribution.
Weight_kg:
2.5th Percentile = 54 kg
2.5% of participants weigh less than 54 kg. This value marks the lower end of the weight distribution.97.5th Percentile = 100 kg
97.5% of participants weigh less than 100 kg, meaning 2.5% weigh 100 kg or more. This value marks the upper end of the weight distribution.
7.5 Skewness and Kurtosis
Objective
Assess the Skewness (asymmetry) and Kurtosis (tailedness) of continuous variables to understand their distribution shapes.
Steps in Jamovi
Navigate to Descriptives:
- Click on Analyses → Exploration → Descriptives.
Select Variables:
- Drag
Height_cmandWeight_kginto the “Variables” box.
- Drag
Choose Statistics:
- In the “Statistics” section, check “Skewness” and “Kurtosis”.
Interpreting Results
Skewness:
Positive Skew: Tail extends to the right; mean > median.
Negative Skew: Tail extends to the left; mean < median.
Zero Skew: Symmetrical distribution.
Kurtosis:
- **High Kurtosis**: Data have heavy tails; more outliers.
- **Low Kurtosis**: Data have light tails; fewer outliers.
- **Mesokurtic**: Normal distribution kurtosis.
Example Interpretation
Height_cm
Skewness (0.06):
Nearly Symmetrical: A skewness value close to 0 indicates that the distribution of Height_cm is nearly symmetrical.
Minimal Asymmetry: With a skewness of 0.06, there is minimal right skew (slightly longer tail on the right), but it’s practically negligible.
Kurtosis (-0.58):
Platykurtic Distribution: A kurtosis value of -0.58 suggests that the distribution of Height_cm is platykurtic, meaning it is flatter than a normal distribution.
Light Tails: The data has lighter tails, indicating fewer outliers compared to a normal distribution.
Weight_kg
Skewness (0.55):
Moderate Right Skew: A skewness value of 0.55 indicates a moderate positive skew, meaning the distribution of Weight_kg has a longer tail on the right side.
Mass Concentration: More participants have weights below the mean, with fewer individuals having significantly higher weights.
Kurtosis (-0.19):
Slightly Platykurtic: A kurtosis of -0.19 suggests that Weight_kg is slightly platykurtic, exhibiting a bit flatter than a normal distribution.
Light Tails: Similar to Height_cm, there are fewer extreme outliers.
7.6 Visualizing Data: Histograms and Boxplots
Objective
Create visual representations of data distributions to complement numerical descriptive statistics.
Steps in Jamovi
Navigate to Descriptives:
- Click on Analyses → Exploration → Descriptives.
Select Variables:
- Drag
Height_cmandWeight_kginto the “Variables” box.
- Drag
Choose Plots:
- Under the “Plots” section, check “Histogram” and “Box plot”.
Interpreting Results
Histogram:
Displays the frequency distribution of the selected variable.
Shape: Observe the distribution’s skewness, modality, and presence of outliers.
Boxplot:
Visualizes the median, quartiles, and potential outliers.
Insights: Identify data symmetry, skewness, and extreme values.
Example Interpretation
Height_cm - Histogram. The histogram provides a visual representation of the distribution of height (Height_cm) in the dataset. Here’s how to interpret this histogram step by step:
Shape of the Distribution
The histogram appears approximately symmetrical, with the highest bar (peak) around 170 cm.
The shape resembles a bell curve, indicating that the data may follow a normal distribution, though there are slight variations in bar heights.
Central Tendency
The peak (mode) of the distribution occurs near 170 cm, suggesting that this is the most common height range.
Since the histogram is symmetrical, the mean and median are likely close to 170 cm as well.
Spread and Range
- The heights range roughly from 150 cm to 180 cm. Most of the data falls between 160 cm and 170 cm, with fewer individuals at the extremes (150 cm or 180 cm).
Density and Frequency
The y-axis (density) indicates the relative frequency of heights in the dataset:
Taller bars represent intervals with more individuals, such as the bar around 170 cm.
Shorter bars represent intervals with fewer individuals, such as at the edges near 150 cm and 180 cm.
Skewness
- The histogram shows minimal skewness, meaning that the data is nearly symmetrical. There’s no prominent tail on either the left or right, which supports the interpretation of a normal distribution.
Outliers
- The histogram does not show any extreme values or outliers; most of the data is concentrated within a reasonable range of heights.
Weight_kg - Histogram. This histogram provides a visual representation of the distribution of weight (Weight_kg) in the dataset. Here’s how to interpret it step by step:
Shape of the Distribution
- The histogram is asymmetrical, with a longer tail to the right. This indicates a positive skew (right skew), meaning there are relatively few individuals with higher weights compared to the majority of the sample.
Central Tendency
- The peak (mode) of the distribution is around 60–65 kg, suggesting that this is the most common weight range in the dataset.
The mean is likely greater than the mode due to the positive skew, as the higher weights pull the mean toward the right.
The median would fall between the mode and the mean, closer to the bulk of the data.
Spread and Range
- The weights range approximately from 50 kg to 110 kg.
- Most of the data is concentrated between 60 kg and 80 kg, with relatively fewer individuals below 60 kg or above 100 kg.
Density and Frequency
The y-axis (density) represents the relative frequency of individuals within weight intervals:
The highest bars, around 60–65 kg, indicate that a large proportion of participants fall within this weight range.
The shorter bars toward the higher end (above 100 kg) indicate fewer participants in these weight ranges.
Skewness
The histogram exhibits a moderate positive skew, with a longer tail extending to the right.
This suggests that while the majority of individuals fall within a relatively narrow range of weights, there are a few participants with much higher weights.
Outliers
- The right tail suggests the presence of potential outliers among individuals with higher weights (above 100 kg). These outliers could disproportionately affect statistical analyses, particularly those involving means.
Height_cm - Boxplot. This is a boxplot for Height_cm, a graphical representation of the distribution of height in the dataset. Here’s how to interpret it:
Components of the Boxplot
Box:
Represents the interquartile range (IQR), which contains the middle 50% of the data.
The bottom edge of the box is the 25th percentile (Q1), and the top edge is the 75th percentile (Q3).
The line inside the box is the median (Q2), indicating the middle value of the dataset.
Whiskers:
Extend to the smallest and largest values within 1.5 times the IQR from Q1 and Q3.
These represent the spread of the data, excluding potential outliers.
Potential Outliers:
- Points beyond the whiskers (not shown here) would be considered outliers.
Key Observations
Median (Q2):
- The median line is located slightly above the center of the box, suggesting that the distribution of height is approximately symmetric, with a small possible skew toward higher values.
Interquartile Range (IQR):
- The height of the box (distance between Q1 and Q3) reflects the spread of the middle 50% of the data. This range is relatively narrow, indicating moderate variability in participants’ heights.
Whiskers:
The whiskers extend to the minimum (~150 cm) and maximum (~180 cm) heights within 1.5 × IQR.
There are no visible outliers, as no points are shown outside the whiskers.
Distribution Characteristics
Symmetry:
- The box and whiskers are roughly balanced, with the median near the center of the box, suggesting that the distribution of height is close to normal.
Range:
- The total range of the data (from the bottom whisker to the top whisker) is approximately 150 cm to 180 cm.
Central Tendency:
- The median indicates the central value of height, likely close to 165 cm (based on earlier statistics).
Conclusion
- This boxplot for Height_cm shows a symmetrical distribution with no outliers and moderate variability. The data appears to be normally distributed, making it well-suited for parametric statistical analyses. The majority of participants’ heights fall between 160 cm and 170 cm, with a total range from 150 cm to 180 cm.
Weight_kg - Boxplot. This is a boxplot for Weight_kg, showing the distribution of weight in the dataset. Here’s how to interpret it:
Components of the Boxplot
Box:
Represents the interquartile range (IQR), which contains the middle 50% of the data.
The bottom edge of the box corresponds to the 25th percentile (Q1).
The top edge of the box corresponds to the 75th percentile (Q3).
The line inside the box is the median (Q2), indicating the middle value of the dataset.
Whiskers:
Extend to the smallest and largest values within 1.5 times the IQR from Q1 and Q3.
These represent the range of the data, excluding potential outliers.
Outliers:
- Dots above the top whisker indicate outliers, which are values greater than 1.5 × IQR above Q3.
Key Observations
Median (Q2):
- The median is slightly below the center of the box, suggesting that the distribution is slightly right-skewed (positive skew).
Interquartile Range (IQR):
- The height of the box reflects the spread of the middle 50% of the data, with most participants weighing between approximately 65 kg and 80 kg.
Whiskers:
- The whiskers extend downward to approximately 55 kg and upward to approximately 100 kg, indicating the range of most weights without considering outliers.
Outliers:
Several dots above the top whisker indicate higher weights (above 100 kg) that are considered outliers.
These outliers represent participants with significantly higher weights compared to the majority of the dataset.
Distribution Characteristics
Skewness:
- The median’s position below the center of the box and the presence of outliers at the upper end suggest a moderate positive skew, with a longer tail extending toward higher weights.
Variability:
- The IQR (distance between Q1 and Q3) is moderate, indicating that the middle 50% of weights are relatively close in range, but the presence of outliers increases the overall variability.
Outliers:
- The outliers (above 100 kg) may disproportionately affect the mean weight, making the median a better measure of central tendency for this variable.
Implications for Analysis
Normality:
- The positive skew and presence of outliers indicate that the weight distribution may not follow a normal distribution, which could affect statistical analyses requiring normality.
Outliers:
- The outliers may need to be examined further to determine whether they are valid data points, measurement errors, or extreme but meaningful values.
Central Tendency:
- Given the skewness, the median weight is likely more representative of the central tendency than the mean.
Conclusion
- The boxplot for Weight_kg shows a moderately right-skewed distribution with several outliers above 100 kg. Most participants weigh between 65 kg and 80 kg, with a range extending from approximately 55 kg to 100 kg. The skewness and outliers suggest that parametric analyses requiring normality should be applied cautiously, and non-parametric methods or data transformation may be more appropriate.
Step 8. Univariate Inferential Statistics
Univariate hypothesis tests are statistical procedures used to analyze the differences pertaining to a single variable. These tests help determine whether a variable significantly differs from a hypothesized value or distribution.
We will explore various univariate hypothesis tests using the mlbootcamp5_train.xlsx dataset in Jamovi. Each test includes a clear hypothesis, step-by-step instructions for execution in Jamovi, and guidance on interpreting the results.
8.1 One-Sample t-Test: Testing Mean Age
Objective
Determine if the average age of participants differs significantly from 30 years.
Hypotheses
Null Hypothesis (\(H_0\)): The mean age is equal to 30 years.
\(H_0: \mu = 30\)Alternative Hypothesis (\(H_1\)): The mean age is not equal to 30 years.
\(H_0: \mu \neq 30\)
Steps in Jamovi
Perform One-Sample t-Test:
Navigate to Analyses → T-Tests → One-Sample T-Test.
Variables: Move
Age_yearsto the Variables box.Test Value: Enter
30.Options: Check Descriptives and Descriptive plots if desired.
Interpreting Results
Descriptives: Review the mean, standard deviation, and sample size.
t-Test Results:
t-value: Indicates the ratio of the difference between sample mean and test value to the standard error.
Degrees of Freedom (df): Typically \(n-1\).
p-value: If \(p<0,05\), reject the null hypothesis.
Example Interpretation: Since \(p<0,001\) , conclude that the average age significantly differs from 30 years.
8.2 Binomial test (Proportion Test): Gender Distribution
Objective
Assess whether the gender distribution of participants deviates from an expected 50% Female and 50% Male distribution.
The binomial test is used to determine whether the proportion of two categories in a binary variable (e.g., gender) deviates from an expected proportion. In this case, the question asks whether the gender distribution of participants deviates from an expected 50% Female and 50% Male distribution.
Hypotheses
Null Hypothesis (\(H_0\)) : Gender distribution is 50% Female and 50% Male.
\(H_0: P(Female)=0.5, P(Male)=0.5\)Alternative Hypothesis (\(H_1\)): Gender distribution is not 50% Female and 50% Male.
\(H_1: P(Female) \neq 0.5, P(Male) \neq 0.5\)
Steps in Jamovi
Navigate to Chi-Square Test:
- Go to Analyses → Frequency → 2 Outcomes - Binomial test
Set Up the Test:
Variables: Drag
Genderto the Variables box.Test value: Enter 0.5.
Interpreting Results
Observed Proportions: Displays the proportion of females and males in the sample (e.g., Female = 65%, Male = 35%).
p-value: If p<0.05p < 0.05, reject the null hypothesis.
Example Interpretation
Hypothesis
- Null Hypothesis (\(H_0\)): The proportion of females and males matches the expected proportion of 50% Female and 50% Male**.
- Alternative Hypothesis (\(H_A\)): The proportion of females and males does not match the expected 50%-50% distribution.
Key Results
Counts and Proportions
Female:
Count: 41,334 participants.
Total: 63,259 participants.
Proportion: \(\frac{41334}{63259} = 0.65\) (65% of the total sample are female).
Male:
Count: 21,925 participants.
Total: 63,259 participants.
Proportion: \(\frac{21925}{63259} = 0.35\) (35% of the total sample are male).
These proportions indicate that there are significantly more females than males in the dataset.
p-Values
Both proportions for Female and Male have a p-value of < 0.0001.
This means that the observed proportions significantly deviate from the expected 50%-50% distribution.
In other words, the null hypothesis (\(H_0\)) is rejected.
Statistical Interpretation
The proportion of females (65%) is significantly higher than the expected 50%.
Similarly, the proportion of males (35%) is significantly lower than the expected 50%.
Since both p-values are < 0.0001, this deviation is highly statistically significant.
Practical Implications
Gender Imbalance: The dataset is skewed toward female participants (65%), indicating a notable gender imbalance.
Bias or Representation: This imbalance could reflect:
Sampling bias (e.g., the dataset was collected in a way that over-represented females).
Real-world trends (e.g., more females were willing or available to participate in the study).
Generalizability: The findings from this dataset might be more reflective of the female population and less representative of the male population.
Reporting Results
- A binomial test was conducted to assess whether the gender distribution of participants deviated from an expected 50% Female and 50% Male distribution. The results showed a significant deviation ($p<0.0001$), with 65% of participants being female ($n=41,334$) and 35% being male (n=21,925). These findings indicate a significant gender imbalance in the dataset.
8.3 Chi-Square Goodness-of-Fit Test: Cholesterol Level Distribution
Objective
Evaluate whether the distribution of cholesterol levels matches the expected distribution:
Normal: 50%
Above Normal: 30%
Well Above Normal: 20%
Hypotheses
Null Hypothesis (\(H_0\)): The cholesterol level distribution matches the expected proportions.
$H_0: P(\text{Normal}) = 0.50, P(\text{Above Normal}) = 0.30, P(\text{Well Above Normal}) = 0.20$Alternative Hypothesis (H₁): The distribution does not match the expected proportions.
$H_1$: At least one proportion differs.
Steps in Jamovi
Navigate to Chi-Square Test:
- Go to Analyses → Frequency → N Outcomes - \(\chi^2\) Godness of fit
Set Up the Test:
Variables: Drag
CholesterolLevelto the Variables box.Expected Counts: Check the Expected counts checkbox.
Expected Proportions:
Enter
5for Normal,3for Above Normal, and 2 for Well Above Normal.The test uses expected proportions based on the given ratios:
Normal: Ratio = 5 → Proportion = 0.50 (50% expected)
Above Normal: Ratio = 3 → Proportion = 0.30 (30% expected)
Well Above Normal: Ratio = 2 → Proportion = 0.20 (20% expected)
Interpreting Results
Observed vs. Expected Counts: Compare actual counts to expected counts based on proportions.
Chi-Square Statistic: Higher values indicate greater discrepancy.
p-value: If p<0.05p < 0.05, reject the null hypothesis.
Example Interpretation
Chi-Square Test Results
Chi-Square Statistic (\(\chi^2\))
- The \(\chi^2\) statistic is 16,474.78. This value measures how much the observed counts deviate from the expected counts.
Degrees of Freedom (\(df\))
- Degrees of freedom (\(df\)): \(k−1=3−1=2\), where \(k\) is the number of categories.
p-Value
- p < 0.0001: The p-value is highly significant, meaning that the observed proportions deviate significantly from the expected proportions.
Interpretation
The observed proportions (75% Normal, 13% Above Normal, 11% Well Above Normal) significantly differ from the expected proportions (50%, 30%, 20%).
The null hypothesis is rejected (\(p<0.0001\)), indicating that the distribution of cholesterol levels in the dataset does not match the expected distribution.
Practical Implications
Normal Cholesterol Levels: The observed proportion (75%) is significantly higher than the expected proportion (50%), suggesting that a larger percentage of participants have normal cholesterol than anticipated.
Above Normal and Well Above Normal Levels: Both categories are underrepresented compared to the expected proportions (13% vs. 30% and 11% vs. 20%, respectively). This could indicate that fewer participants have elevated cholesterol levels than expected based on the predefined ratios.
Reporting Results
- A chi-square goodness-of-fit test was conducted to determine whether the observed distribution of cholesterol levels deviated from the expected proportions (50% Normal, 30% Above Normal, 20% Well Above Normal). The results were significant ($^2(2) = 16,474.78, p<0.0001$), indicating that the observed proportions (75% Normal, 13% Above Normal, 11% Well Above Normal) significantly differed from the expected proportions. This suggests that the sample has a higher proportion of individuals with normal cholesterol and fewer with elevated cholesterol than anticipated.
8.4 Normality of Height
Objective
Assess whether the distribution of participants’ heights is normally distributed.
Normality tests assess whether a variable (in this case, Height_cm) follows a normal distribution. This is important for determining whether parametric tests, which assume normality, can be applied.
This section presents the results of normality tests conducted on the variable Height_cm. Here’s a detailed interpretation of each test and the overall conclusions:
Hypotheses
Null Hypothesis (\(H_0\)): Heights are normally distributed.
\(H_0:\text{Height} \sim \mathcal{N}(\mu, \sigma)\)Alternative Hypothesis (H₁): Heights are not normally distributed.
\(H_1:\text{Height} \not\sim \mathcal{N}(\mu, \sigma)\)
Steps in Jamovi
Navigate to Descriptive Statistics:
Go to Analyses → Exploratory → Descriptives.
Variables: Drag
Height_cmto the Variables box.Under Statistics, check Normality - Shapiro-Wilk.
Under Plots, check Q-Q plots.
If the moretest package is not installed, install it.
Navigate to T-Tests:
Go to Analyses → T-Tests → One Sample T-Test.
Variables: Drag
Height_cmto the Variables box.Under Assumptions Checks, check Normality test and Q-Q Plot.
Interpreting Results
Shapiro-Wilk Test
Statistic: Not available (NaN).
Reason: The Shapiro-Wilk test was not calculated because the sample size exceeds 5,000 observations. Shapiro-Wilk is typically used for smaller datasets and is not computed for very large samples due to computational limitations and sensitivity to large sample sizes.
Kolmogorov-Smirnov Test
Statistic: (\(0.07\))
p-value: (\(< 0.0001\))
- This indicates a significant result, meaning the distribution of Height_cm deviates from normality.
Anderson-Darling Test
Statistic: (\(164.11\))
p-value: (\(< 0.0001\))
- This also shows a significant result, confirming that Height_cm does not follow a normal distribution.
QQ Plot
The QQ (Quantile-Quantile) plot is a graphical method to assess normality by comparing the distribution of the data to a theoretical normal distribution.
Observations from the QQ Plot:
Straight Line: The points closely follow a straight line, particularly in the central portion of the distribution, indicating that the middle range of the data aligns well with a normal distribution.
Deviations at the Tails: Some deviation is visible at the upper and lower tails (ends of the plot). This suggests potential minor deviations from normality, likely due to extreme values or outliers.
Normality:
While the Shapiro-Wilk test is not computed due to the large sample size, the QQ plot provides sufficient evidence that the data is approximately normal, with some minor deviations at the tails.
In large datasets, even small deviations from normality can appear statistically significant but may not be practically meaningful.
Interpretation of Results
- Non-Normality: The tests suggest that Height_cm is not perfectly normally distributed. However, the deviation might not be practically significant depending on the shape of the distribution.
- Parametric Tests: Despite the non-normality, parametric tests (e.g., t-tests, ANOVA) are robust to minor deviations from normality, particularly for large sample sizes, due to the Central Limit Theorem.
- piro-Wilk Statistic (W): Measures the degree of normality.
Considerations for Large Sample Sizes
In datasets with large sample sizes:
Normality tests tend to detect even minor deviations from normality as statistically significant.
Visual inspection methods (e.g., histograms, QQ plots) should complement normality tests to assess practical (rather than statistical) normality.
Example Reporting
- Normality tests were conducted to assess whether Height_cm follows a normal distribution. The Kolmogorov-Smirnov Test (\(D = 0.07\)), (\(p < 0.0001\)) and Anderson-Darling Test (\(A = 164.11\)), (\(p < 0.0001\)) both indicated significant deviations from normality. However, given the large sample size, visual inspection of the data is recommended to assess the practical implications of these findings. The QQ plot of the data reveals that the distribution closely follows a normal distribution, with minor deviations at th